myHadoop - Hadoop-on-Demand on Traditional HPC Resources
نویسندگان
چکیده
Traditional High Performance Computing (HPC) resources, such as those available on the TeraGrid, support batch job submissions using Distributed Resource Management Systems (DRMS) like TORQUE or the Sun Grid Engine (SGE). For large-scale data intensive computing, programming paradigms such as MapReduce are becoming popular. A growing number of codes in scientific domains such as Bioinformatics and Geosciences are being written using open source MapReduce tools such as Apache Hadoop. It has proven to be a challenge for Hadoop to co-exist with existing HPC resource management systems, since both provide their own job submissions and management, and because each system is designed to have complete control over its resources. Furthermore, Hadoop uses a shared-nothing style architecture, whereas most HPC resources employ a shared-disk setup. In this paper, we describe myHadoop, a framework for configuring Hadoop on-demand on traditional HPC resources, using standard batch scheduling systems. With myHadoop, users can develop and run Hadoop codes on HPC resources, without requiring root-level privileges. Here, we describe the architecture of myHadoop, and evaluate its performance for a few sample, scientific use-case scenarios. myHadoop is open source, and available for download on SourceForge.
منابع مشابه
Pilot-Abstraction: A Valid Abstraction for Data-Intensive Applications on HPC, Hadoop and Cloud Infrastructures?
HPC environments have traditionally been designed to meet the compute demand of scientific applications and data has only been a second order concern. With science moving toward data-driven discoveries relying more and more on correlations in data to form scientific hypotheses, the limitations of existing HPC approaches become apparent: Architectural paradigms such as the separation of storage ...
متن کاملBig Data at HPC Wales
This paper describes an automated approach to handling Big Data workloads on HPC systems. We describe a solution that dynamically creates a unified cluster based on YARN in an HPC Environment, without the need to configure and allocate a dedicated Hadoop cluster. The end user can choose to write the solution in any combination of supported frameworks, a solution that scales seamlessly from a fe...
متن کاملHadoop on a Low-Budget General Purpose HPC Cluster in Academia
In the last decade, we witnessed an increasing interest in High Performance Computing (HPC) infrastructures, which play an important role in both academic and industrial research projects. At the same time, due to the increasing amount of available data, we also witnessed the introduction of new frameworks and applications based on the MapReduce paradigm (e.g., Hadoop). Traditional HPC systems ...
متن کاملScalable Inverted Indexing on NoSQL Table Storage
The development of data intensive problems in recent years has brought new requirements and challenges to storage and computing infrastructures. Researchers are not only doing batch loading and processing of large scale of data, but also demanding the capabilities of incremental updates and interactive analysis. Therefore, extending existing storage systems to handle these new requirements beco...
متن کاملCloud Solutions for High Performance Computing: Oxymoron or Realm?
Preliminary notes In the last years a strong interest of the HPC (High Performance Computing) community raised towards cloud computing. There are many apparent benefits of doing HPC in a cloud, the most important of them being better utilization of computational resources, efficient charge back of used resources and applications, on-demand and dynamic reallocation of computational resources bet...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004